Operationalizing available research computing resources for stock assessment

Nicholas Ducharme-Barth and Megumi Oshima

2024-11-05

What?


Research computing is the collection of computing, software, storage resources and services that allows for data analysis at scale.

In our particular case we are interested in leverging research computing to augment stock assessment worflows.

Run more/bigger models in less time

Why?


Improve efficiency by running 10s - 1000s of models ‘simultaneously’.

2021 Southwest Pacific Ocean swordfish stock assessment

9,300 model runs totalling ~46 months of computation time.

Why?

  • Efficiency
  • Knowledge acquisition
  • Automation, transparency, reproducibility & portability
  • Multi-model inference

Software containers

Better science

How?

High-throughput computing (HTC)

  • Set-up to handle running many jobs simultaneously
  • Ideal for running short, small, independent (embarrassingly parallel) jobs.

High-performance computing (HPC)

  • Can handle HTC workflows (in theory)
  • Can also handle long running, large, multi-processor jobs (true parallel processing)

2024 North Pacific shortfin mako shark assessment: Used HTC resources to complete ~4 months months of computations (18,000 simulation-estimation model runs) in ~3 hours (1027x faster) during working group meeting.

Example: Fitting large spatiotemporal model in R using TMB required 128 CPUs & 1TB RAM.

How?

High-throughput computing (HTC)

High-performance computing (HPC)

Photo credit: NOAA

OpenScienceGrid (OSG): OSPool

How?

OpenScienceGrid (OSG)

  • Uses HTCondor distributed computing network (no shared file system between compute nodes) to implement HTC workflows
  • Free to use for US based researchers affiliated with academic/government organization and using OSG for research/education efforts
  • Should not be used to analyze protected data

NOAA Hera

  • Uses Slurm to schedule HPC (or HTC) workflows
  • Shared file system between compute nodes
  • NOAA resource so no restrictions on acceptable use/analyzing protected data if working on mission related tasks
  • Allocation determines access

Both use software containers

Software containers


Many may already be using containers such as GitHub Codespaces or Posit Workbench in existing cloud-based workflows

  • Application: set up identical, custom software environments on OSG and Hera
  • Application: use to “version” analyses by “freezing” packages/libraries

Software containers

Apptainer

  • Secure, portable and reproducible software container for Linux operating systems
  • Easy to use
  • Doesn’t require root privileges to build
  • Plays nice with existing containers (e.g., Docker)

Apptainer

Let’s look at an example (linux-r4ss-v4.def):

Bootstrap: docker
From: ubuntu:20.04

%post
    TZ=Etc/UTC && \
    ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && \
    echo $TZ > /etc/timezone
    apt update -y
    apt install -y \
        tzdata \
        curl \
        dos2unix

    apt-get update -y
    apt-get install -y \
            build-essential \
            cmake \
            g++ \
            libssl-dev \
            libssh2-1-dev \
            libcurl4-openssl-dev \
            libfontconfig1-dev \
            libxml2-dev \
            libgit2-dev \
            wget \
            tar \
            coreutils \
            gzip \
            findutils \
            sed \
            gdebi-core \
            locales \
            nano
    
    locale-gen en_US.UTF-8

    export R_VERSION=4.4.0
    curl -O https://cdn.rstudio.com/r/ubuntu-2004/pkgs/r-${R_VERSION}_1_amd64.deb
    gdebi -n r-${R_VERSION}_1_amd64.deb

    ln -s /opt/R/${R_VERSION}/bin/R /usr/local/bin/R
    ln -s /opt/R/${R_VERSION}/bin/Rscript /usr/local/bin/Rscript

    R -e "install.packages('remotes', dependencies=TRUE, repos='http://cran.rstudio.com/')"
    R -e "install.packages('data.table', dependencies=TRUE, repos='http://cran.rstudio.com/')"
    R -e "install.packages('magrittr', dependencies=TRUE, repos='http://cran.rstudio.com/')"
    R -e "install.packages('mvtnorm', dependencies=TRUE, repos='http://cran.rstudio.com/')"
    R -e "remotes::install_github('r4ss/r4ss')"
    R -e "remotes::install_github('PIFSCstockassessments/ss3diags')"

    NOW=`date`
    echo 'export build_date=$NOW' >> $SINGULARITY_ENVIRONMENT

    mkdir -p /ss_exe
    curl -L -o /ss_exe/ss3_linux https://github.com/nmfs-ost/ss3-source-code/releases/download/v3.30.22.1/ss3_linux
    chmod 755 /ss_exe/ss3_linux

%environment
    export PATH=/ss_exe:$PATH
    
%labels
    Author nicholas.ducharme-barth@noaa.gov
    Version v0.0.4

%help
    This is a Linux (Ubuntu 20.04) container containing Stock Synthesis (version 3.30.22.1), R (version 4.4.0) and the R packages r4ss, ss3diags, data.table, magrittr, and mvtnorm.

Apptainer

Let’s look at an example (linux-r4ss-v4.def):

Build on Linux system with Apptainer installed.

apptainer build linux-r4ss-v4.sif linux-r4ss-v4.def